1
Foundations of CUDA Kernel Development
AI021 Lesson 2
00:00

CUDA Kernel development begins with the definition of a Kernel, which is a specialized C++ function designed to execute in parallel across the massive core count of an NVIDIA GPU. These functions represent the fundamental unit of work in the CUDA programming model, acting as the bridge where serial host logic transitions into massively parallel device execution.

1. The __global__ Specifier

The __global__ declaration specifier is a required API qualifier that instructs the compiler to generate code for the GPU while keeping the function entry point visible to the CPU. Functions which execute on the GPU which can be invoked from the host are called kernels.

2. Execution Environment

Kernels are dispatched to and executed upon Streaming Multiprocessors (SMs). The SM is the primary computational engine within an NVIDIA GPU responsible for managing hundreds of concurrent threads. Each SM handles blocks of threads and schedules them onto processing cores.

Syntax Rule: Kernels must strictly return void. Because they operate asynchronously from the host, they cannot return a value directly to the CPU; they must write results back to allocated device memory.

Host (CPU)Device (NVIDIA GPU)StreamingMultiprocessor (SM)Kernel Launch
main.py
TERMINAL bash — 80x24
> Ready. Click "Run" to execute.
>